Introduction to Open Data Science - Course Project

About the project

Welcome to my Open Data Science Project -page. Every chapter includes analysis of different (open) data with different multivariate statistical methods. Check also link to my GitHub repository: https://github.com/paikalla/IODS-project


Linear regression and model validation

Dataset

# Setting up
library(car)
library(GGally)
library(ggplot2)

# Data
learning2014 <- read.csv("./data/learning/learning2014.csv", stringsAsFactors = TRUE)
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.5)))

We will choose attitude, stra, and surf as explanatory variables by looking at the correlations of variable points with other variables.

Fit a linear model for exam points

Let’s fit a linear model with attitude, stra and surf as explanatory variables.

my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

As shown on the model summary, the intercept (11.0171) shows the estimated exam points when explanatory variables are set to zero. According to the model, a one-point increase in general attitude towards statistics increases the expected exam points by 3.4. The estimated effect is statistically highly significant (p-value < 0.001). We can state that under the null hypothesis of no effect (β = 0), it is very unlikely to get an estimate of β of size 3.3952.

As other variables are not statistically significant, let’s adjust our model by keeping only attitude as an explanatory variable:

fit2 <- lm(points ~ attitude, data = learning2014)
summary(fit2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now the estimated effect of a one-point increase in variable Attitude is to increase the exam points by 3.5. The estimated effect if highly significant (p-value < 0.001). However, the R-squared is only 0.1906, which means that the explanatory power of the model is low. The variable Attitude only explains 19.06 % of the variation in the variable exam points. The better the model fits the data, the closer the R-squared is to one.

Validity of model assumptions

The following diagnostic plots help to evaluate the validity of our linear assumptions. A linear regression model assumes

  1. errors (residuals) are normally distributed

  2. homoscedasticity: errors have mean zero and a constant variance

  3. linear relationship between response and explanatory variables

  4. independence of observations (experimental vs. dependent observations in time series)

  5. all relevant explanatory variables are used

  6. no or little multicollinearity (test with vif() Variance Inflation Factor: A VIF of 1 indicates no multicollinearity for this variable; A VIF of higher than 5 or 10 indicates a problem)

  7. Explanatory variables are uncorrelated with the error term

par(mfrow=c(2,2))
plot(fit2, which=c(1,2,5))
hist(fit2$residuals) # normality 

Residuals vs Fitted values

Makes it possible to inspect the assumptions of zero mean and constant variance. The residuals are scattered around zero and their spread does not depend on the fitted values. (With fitted values of 24 or larger, there are some large negative residuals, though.) No patterns in scatter plot, so our model seems to meet assumption 2) homoscedasticity.

Normal QQ-plot

The normal QQ plot inspects whether the assumption 1) errors are normally distributed is met. In case the residuals are normally distributed, they should lie on a straight line in the QQ plot. This is approximately the case, except some deviations at the both ends of the distribution.

Residuals vs Leverage

Residual vs Leverage makes it possible to explore if there are some observations that have a high influence on the model. Data points with large residuals (outliers) and/or high leverage may violate the outcome and accuracy of a regression. Seems like no observations have an excessively large influence on the model fit.


library(tidyverse)
library(ggplot2)

Logistic regression

Dataset

alc_data <- read.csv("./data/student/pormath.csv", stringsAsFactors = TRUE)
dim(alc_data)
## [1] 370  51
colnames(alc_data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

What explains high alcohol consumption

Graphs.

# keep only variables of interest
alc <- alc_data %>% select(high_use,sex,famrel,absences,G3, goout, alc_use)
summary(select(alc_data,.data$G3,.data$absences,.data$famrel,.data$sex,.data$high_use,.data$goout))
##        G3           absences          famrel      sex      high_use      
##  Min.   : 0.00   Min.   : 0.000   Min.   :1.000   F:195   Mode :logical  
##  1st Qu.:10.00   1st Qu.: 1.000   1st Qu.:4.000   M:175   FALSE:259      
##  Median :12.00   Median : 3.000   Median :4.000           TRUE :111      
##  Mean   :11.52   Mean   : 4.511   Mean   :3.935                          
##  3rd Qu.:14.00   3rd Qu.: 6.000   3rd Qu.:5.000                          
##  Max.   :18.00   Max.   :45.000   Max.   :5.000                          
##      goout      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :3.116  
##  3rd Qu.:4.000  
##  Max.   :5.000
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

# or
p_g3 <- ggplot(data=alc_data,aes(x=high_use,y=G3,fill=high_use))
p_g3 + geom_boxplot() + ylab("Final grade")

p_going_out <- ggplot(data=alc_data,aes(x=goout,fill=high_use))
p_going_out + geom_bar() + ylab("Going out")

p_studytime <- ggplot(data=alc_data,aes(x=studytime,fill=high_use))
p_studytime + geom_bar()

p_internet <- ggplot(data=alc_data,aes(x=internet,fill=high_use))
p_internet + geom_bar()

Crosstables

table1 <- table(alc_data$sex, alc_data$high_use,dnn=c("sex","high_alc"))
addmargins(round(prop.table(table1)*100,1)) 
##      high_alc
## sex   FALSE  TRUE   Sum
##   F    41.6  11.1  52.7
##   M    28.4  18.9  47.3
##   Sum  70.0  30.0 100.0
table2 <- table(alc_data$famrel, alc_data$high_use,dnn=c("famrel","high_alc"))
addmargins(round(prop.table(table2)*100,1)) 
##       high_alc
## famrel FALSE TRUE  Sum
##    1     1.6  0.5  2.1
##    2     2.4  2.4  4.8
##    3    10.5  6.8 17.3
##    4    34.6 14.1 48.7
##    5    20.8  6.2 27.0
##    Sum  69.9 30.0 99.9
table(alc_data$goout, alc_data$high_use)
##    
##     FALSE TRUE
##   1    19    3
##   2    82   15
##   3    97   23
##   4    40   38
##   5    21   32

Boxplots

The Box plots show the associations between the continuous variables grade and absences with alcohol consumption, drawn separately for females and males.

# initialize a plot of high_use and G3, grouping by sex
g1 <- ggplot(alc, aes(x = high_use, y = G3,col=sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") +
ggtitle("Student grades by alcohol consumption and sex")

Logistic regression model - what causes high alcohol consumption

logistic <- glm(high_use ~  famrel + absences + G3 + goout + sex, data = alc, family = "binomial")

# print out a summary of the model
summary(logistic)
## 
## Call:
## glm(formula = high_use ~ famrel + absences + G3 + goout + sex, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7675  -0.7622  -0.4864   0.7441   2.6334  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.19616    0.83823  -2.620  0.00879 ** 
## famrel      -0.41131    0.14192  -2.898  0.00375 ** 
## absences     0.08020    0.02229   3.598  0.00032 ***
## G3          -0.04008    0.04048  -0.990  0.32213    
## goout        0.74913    0.12564   5.963 2.48e-09 ***
## sexM         1.07181    0.26467   4.050 5.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 364.33  on 364  degrees of freedom
## AIC: 376.33
## 
## Number of Fisher Scoring iterations: 4
better_logistic <- glm(high_use ~  famrel + absences  + goout + sex, data = alc, family = "binomial")

Odd ratios

# compute odds ratios (OR)
OR <- exp(coef(logistic))

# compute confidence intervals (CI)
CI <- logistic %>% confint %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1112294 0.02062473 0.5564523
## famrel      0.6627789 0.50005846 0.8740026
## absences    1.0835025 1.03796942 1.1344714
## G3          0.9607171 0.88730625 1.0403990
## goout       2.1151573 1.66463216 2.7274827
## sexM        2.9206466 1.75080525 4.9533931
OR2 <- exp(coef(better_logistic))

# compute confidence intervals (CI)
CI2 <- better_logistic %>% confint %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR2, CI2)
##                    OR2      2.5 %    97.5 %
## (Intercept) 0.06646973 0.01731071 0.2381125
## famrel      0.65884637 0.49727062 0.8681721
## absences    1.08565191 1.04010669 1.1366801
## goout       2.15919501 1.70320000 2.7776306
## sexM        2.94291184 1.76506498 4.9902818
#predict() the probability of high_use
probabilities <- predict(logistic, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65675676 0.04324324 0.70000000
##    TRUE  0.15675676 0.14324324 0.30000000
##    Sum   0.81351351 0.18648649 1.00000000

I conduct a simple model validation by tabulating the actual outcomes versus model predictions. The proportion of incorrectly predicted students is around 20 percent (15.7 + 4.3). Thus, I would next search for additional explanatory variables!

Loss function:

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2

Cross validation:

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2
# K-fold cross-validation
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:car':
## 
##     logit
cv <- cv.glm(data = alc, cost = loss_func, glmfit = logistic, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2054054
alc2 <- alc_data %>% select(G3, goout,studytime,internet, high_use, absences, health, internet, nursery, romantic, sex)

logistic2 <- glm(high_use ~  internet + nursery + goout + sex, data = alc2, family = "binomial")
summary(logistic2)
## 
## Call:
## glm(formula = high_use ~ internet + nursery + goout + sex, family = "binomial", 
##     data = alc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7162  -0.8349  -0.5965   0.8533   2.5304  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.5003     0.5761  -6.076 1.24e-09 ***
## internetyes  -0.0319     0.3549  -0.090 0.928387    
## nurseryyes   -0.3894     0.3055  -1.275 0.202418    
## goout         0.7618     0.1198   6.361 2.01e-10 ***
## sexM          0.9357     0.2519   3.715 0.000203 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 387.34  on 365  degrees of freedom
## AIC: 397.34
## 
## Number of Fisher Scoring iterations: 4
#predict() the probability of high_use
probabilities2 <- predict(logistic2, type = "response")

# add the predicted probabilities to 'alc'
alc2 <- mutate(alc2, probability = probabilities2)

# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)

loss_func(class = alc2$high_use, prob = alc2$probability)
## [1] 0.2189189
cv2 <- cv.glm(data = alc2, cost = loss_func, glmfit = logistic2, K = 5)

# average number of wrong predictions in the cross validation
cv2$delta[1]
## [1] 0.2432432

Clustering and classification

library(MASS)
library(GGally)  
library(ggplot2)
library(tidyverse)
library(corrplot)

Dataset

head(Boston, n = 4)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Variables scale quite differently.

# plot distributions and correlations of continuous variables 
p <- Boston %>% 
  ggpairs(
    mapping = aes(alpha=0.5), 
    lower = list(continuous = wrap("points", colour = "cornflower blue", size=0.3)), 
    upper = list(continuous = wrap("cor", size = 2)),
    diag = list(continuous = wrap("densityDiag", colour = "cornflower blue"))) +
 theme(axis.text.x = element_text(size=5), 
       axis.text.y = element_text(size=5))  
p

There are only few variables that are somewhat normally distributed (rm - which is the average number of rooms per dwelling). Most of the variables are skewed and/or bimodal. Also checking the correlations.

cor_matrix <- round(cor(Boston),2)
corrplot(round(cor(Boston),2),tl.cex=0.7)

There are some negative correlation especially between dis (distance from Boston employment centres) - one of them is a negative correlation with age. Tax has a strong positive correlation with various variables - e.g. rad (accessibility to radial highways). But at least now it is difficult to say much about this data since there does not seem to be any clear patterns.

Scale data

Since the scales of the variables were so all over the place (and that is bad for clustering), we need to scale the data:

boston_sc <- scale(Boston)
boston_sc <- as.data.frame(boston_sc) # scale -function outputs a matrix and we have to transform it into a data frame
summary(boston_sc)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
sd(boston_sc$dis)
## [1] 1

Scaling changed the data in a way which made the mean of all variables 0. It made also the variables to resemble each other more - for example tax’s scale was from 187.0 to 771.0 and after scaling it is from -1.3127 to 1.1764. This step makes sense if we are interested about the distances between the cases that are described in this dataset.

Then we will create a categorical variable called crime which is based on the quantiles of crim variable:

brks <- quantile(boston_sc$crim)
lbls <- c("low","med_low","med_high","high")
crime <- cut(boston_sc$crim, breaks = brks, label = lbls, include.lowest = TRUE)
boston_sc$crime <- crime
boston_sc <- dplyr::select(boston_sc, -crim) # Remove the old crim variable
summary(boston_sc$crime)
##      low  med_low med_high     high 
##      127      126      126      127

Then let’s divide the data into training (80%) and testing sets (20%):

n <- nrow(boston_sc)
ind <- sample(n, size=n*0.8)
train_set <- boston_sc[ind,]
test_set <- boston_sc[-ind,]

Now we have two sets. First, train_set has randomly chosen 80% cases of the Boston data. Second, test_set has randomly chosen 20% of cases.

Linear Discriminant Model

Next I estimate a linear discriminant model with crime as the target variable. The purpose of the analysis is to identify those variables that explain whether a tract has a high or low crime rate. We have here a four-class grouping of crime rate.

A linear discriminant model assumes that explanatory variables are continuous and normally distributed given the classes defined by the target variable. Moreover, a constant variance across the explanatory variables is assumed. According to the preliminary analysis, the assumption of normality is not satisfied. I do not check whether the assumption is satisfied given the crime class but simply assume normality. The constant variance assumption is satisfied because of scaling.

The results from linear discriminant analysis are shown below. The prior probabilities show the proportions of observations belonging to the four groups in the train data. They are not exactly equal because the grouping was done with all the 506 tracts. The variable means differ across crime groups suggesting that they have an association with the crime rate.

The first linear discriminant explains 95% of the variance between the groups based on crime rate.

lda_fit  <- lda(crime ~ ., data = train_set)
lda_fit
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2450495 0.2599010 0.2450495 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       0.97037019 -0.9651881 -0.116404306 -0.9252246  0.47155981 -0.9146792
## med_low  -0.05976813 -0.2739222  0.006051757 -0.5773165 -0.09166025 -0.3374251
## med_high -0.38025152  0.1434859  0.140129052  0.3866161  0.04675566  0.4160315
## high     -0.48724019  1.0149946 -0.033716932  1.0734229 -0.44471256  0.8109429
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9562978 -0.6862262 -0.7209948 -0.46148873  0.37654866 -0.77532277
## med_low   0.3510420 -0.5561264 -0.4471322 -0.07417466  0.30919137 -0.17155481
## med_high -0.3782580 -0.4579515 -0.3647014 -0.29135684  0.08966411  0.05697214
## high     -0.8430128  1.6596029  1.5294129  0.80577843 -0.71377433  0.88890774
##                 medv
## low       0.55429260
## med_low   0.02026568
## med_high  0.12960118
## high     -0.70460017
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.13130839  0.519779886 -0.87969228
## indus    0.01841655 -0.258214756  0.44146395
## chas     0.00501796 -0.006971278  0.11580557
## nox      0.31927340 -0.831506191 -1.42187713
## rm       0.01336716 -0.033781878 -0.16078294
## age      0.20142313 -0.310804913 -0.07038329
## dis     -0.15396449 -0.034370496  0.03467908
## rad      4.05501718  0.778544224 -0.15543608
## tax     -0.01289706  0.344238361  0.59219477
## ptratio  0.22967422 -0.049501812 -0.28478152
## black   -0.11413656  0.046475067  0.12279414
## lstat    0.10078949 -0.192016962  0.18485136
## medv     0.01816022 -0.341646931 -0.23220188
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9624 0.0301 0.0075

The LDA biplot based on the estimated model is shown below. The observations are colored on the basis of their crime group. The arrows indicate that variable rad (index of accessibility to radial highways) is a strong predictor of linear discriminant 1, while variables nox (air pollution) and zn (prop. of residential land zoned for large lots) explain linear discriminant 2.

# the function for lda biplot arrows
lda_arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train_set$crime)
colnames(train_set)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"
# plot the lda results
plot(lda_fit, dimen = 2, col=classes, pch=classes)

lda_arrows(lda_fit, myscale = 2)

Model validation

I use the test data to validate the model, ie. to see whether the observations in the test data are correctly classified.

# save the correct classes from test data
correct_classes <- test_set$crime

# remove the crime variable from test data
test <- dplyr::select(test_set, -crime)

The table below shows (after a little calculation) that roughly 1/3 of observations lie outside the diagonal i.e. are incorrectly predicted by the model. (if prop.table used).

If addmargins() used: It seems that this model was quite good at classifying the cases from the test_set (78/102 were classified right). Although, in the case of med_low, it only managed to get 15/29 right. The whole model managed to place the case in the right “basket of crime rate” in about 76% of the cases. I guess that this could be considered as relatively good success rate.

# predict classes with test data
lda_pred <- predict(lda_fit, newdata = test)

# Confusion Matrix and Accuracy
tab1 <- table(correct = correct_classes, predicted = lda_pred$class) %>% addmargins()  #%>% prop.table 

accuracy1 <- sum(diag(tab1))/sum(tab1)

accuracy1
## [1] 0.4289216

K-means clustering

As a final step, I run a K-means clustering analysis with Boston data set. K-means clustering divides observations into pre-defined number of clusters (K), by minimizing the distance of observations to cluster means (centroids). I first look at the distances between observations, using a popular distance measure, Euclidean distance.

library(MASS)
data('Boston')

# remove variable chas 
#Boston <- Boston %>% dplyr::select(-chas)

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

I first choose three clusters (K=3). The plot below suggests that variables black, and tax have a strong association with clusters. Many of the other variables seem to have a somewhat weaker effect, too.

# k-means clustering
km <- kmeans(Boston, centers = 2)

summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       28    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
# plot the Boston dataset with clusters
pairs(boston_scaled[1:10], col = km$cluster)

pairs(boston_scaled[4:6], col = km$cluster)

pairs(boston_scaled[7:9], col = km$cluster)

pairs(boston_scaled[10:12], col = km$cluster)

pairs(boston_scaled[13:14], col = km$cluster)

I search for optimal number of clusters K by inspecting how the total of within cluster sum of squares (total WCSS) changes when K changes. I let K run from 1 to 10. The optimal number of clusters is the value of K where the total WCSS drops rapidly.

The plot below shows that the optimal number of clusters is 2.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

I choose K=2 and re-run the K-means clustering algoritm. The plot below gives support to dividing this data set to two clusters. Twcss = total within cluster sum of square

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with 2 clusters
pairs(boston_scaled[1:6], col = km$cluster)

pairs(boston_scaled[7:14], col = km$cluster)

Bonus

data(Boston)
set.seed(22)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
boston_km <- kmeans(boston_again,centers=4)
boston_again$cluster <- boston_km$cluster
boston_lda <- lda(cluster~., data=boston_again)
plot(boston_lda, dimen=2, col=boston_again$cluster)
lda_arrows(boston_lda, myscale=2)

It seems that he most influencial individual variable is black. Other variables that have a relatively strong influence are crim, indus, nox and tax. Variables black and crime seem to “pull” in their “own ways” and most of the variables are in a group that “pulls” to left. The fact that black seems to be influential confirms earlier observations (e.g. 7th part of this exercise).

Super-bonus

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
summary(model_predictors)
##        zn                indus               chas                nox           
##  Min.   :-0.487240   Min.   :-1.55630   Min.   :-0.272329   Min.   :-1.464433  
##  1st Qu.:-0.487240   1st Qu.:-0.90255   1st Qu.:-0.272329   1st Qu.:-0.920756  
##  Median :-0.487240   Median :-0.37560   Median :-0.272329   Median :-0.195854  
##  Mean   : 0.009721   Mean   :-0.02241   Mean   : 0.000539   Mean   :-0.009254  
##  3rd Qu.: 0.289908   3rd Qu.: 1.01499   3rd Qu.:-0.272329   3rd Qu.: 0.658496  
##  Max.   : 3.800473   Max.   : 2.42017   Max.   : 3.664771   Max.   : 2.729645  
##        rm                 age                 dis                rad          
##  Min.   :-3.876413   Min.   :-2.222999   Min.   :-1.26230   Min.   :-0.98187  
##  1st Qu.:-0.564866   1st Qu.:-0.810864   1st Qu.:-0.80209   1st Qu.:-0.63733  
##  Median :-0.081317   Median : 0.317068   Median :-0.22678   Median :-0.52248  
##  Mean   :-0.001396   Mean   :-0.004508   Mean   : 0.02021   Mean   :-0.02017  
##  3rd Qu.: 0.472328   3rd Qu.: 0.914783   3rd Qu.: 0.70867   3rd Qu.: 1.65960  
##  Max.   : 3.473251   Max.   : 1.116390   Max.   : 3.95660   Max.   : 1.65960  
##       tax               ptratio             black             lstat          
##  Min.   :-1.312691   Min.   :-2.70470   Min.   :-3.9033   Min.   :-1.503007  
##  1st Qu.:-0.756434   1st Qu.:-0.49910   1st Qu.: 0.2132   1st Qu.:-0.827687  
##  Median :-0.464213   Median : 0.13602   Median : 0.3797   Median :-0.173372  
##  Mean   :-0.009823   Mean   :-0.01182   Mean   : 0.0183   Mean   :-0.003237  
##  3rd Qu.: 1.529413   3rd Qu.: 0.80578   3rd Qu.: 0.4331   3rd Qu.: 0.607674  
##  Max.   : 1.796416   Max.   : 1.63721   Max.   : 0.4406   Max.   : 3.545262  
##       medv          
##  Min.   :-1.906340  
##  1st Qu.:-0.626046  
##  Median :-0.144916  
##  Mean   : 0.004561  
##  3rd Qu.: 0.268258  
##  Max.   : 2.986505
summary(lda_fit$scaling)
##       LD1                 LD2                LD3          
##  Min.   :-0.153964   Min.   :-0.83151   Min.   :-1.42188  
##  1st Qu.: 0.005018   1st Qu.:-0.25821   1st Qu.:-0.23220  
##  Median : 0.018417   Median :-0.03437   Median :-0.07038  
##  Mean   : 0.370112   Mean   :-0.02844   Mean   :-0.13180  
##  3rd Qu.: 0.201423   3rd Qu.: 0.04648   3rd Qu.: 0.12279  
##  Max.   : 4.055017   Max.   : 0.77854   Max.   : 0.59219
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)

summary(matrix_product)
##       LD1                LD2                 LD3           
##  Min.   :-5.17115   Min.   :-3.647671   Min.   :-3.271755  
##  1st Qu.:-3.15326   1st Qu.:-0.626416   1st Qu.:-0.533163  
##  Median :-2.21145   Median :-0.042644   Median : 0.120865  
##  Mean   :-0.09284   Mean   : 0.000695   Mean   :-0.002707  
##  3rd Qu.: 6.53189   3rd Qu.: 0.722263   3rd Qu.: 0.587911  
##  Max.   : 8.22523   Max.   : 3.102531   Max.   : 2.666062
# install.packages("plotly")
library(plotly)

plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=classes, size=I(40))
train_km <- kmeans(train_set[,-14],centers=4)
cluster_col <- train_km$cluster

plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=cluster_col, size=I(40))

The way the points are scattered in these plots resemble each other relatively well. Based on these two plots, I would say that crime-variable seems to have relatively large role when the algorithm defines the cluster. These two plots illustrate it, since the point colors of clusters seem to be scattered quite similarly to the crime rate.


Principal Component Analysis